Group 10: Hamza Siddiqui, Hridyansh Gupta, Maher Thakkar, Manas M Bhat and Parika Rawat
Course: BUDT704
Section: 0502
Date: 5th December 2022
Traffic violations are acts of breaching traffic regulations that are intended to safeguard the lives of everyone in the vicinity of the vehicles. We are the Traffic Risk Analysis Team located in College Park, Maryland. We aim to analyze traffic incidents and violations in the United States to keep the roads safer. The way we keep them safer is to show our analysis to police departments and local government authorities in order for them to see where and why traffic violations/incidents occur. In this report, we are analyzing traffic violations that occur in Maryland with the majority taking place in Montgomery County. We acquired the dataset from Kaggle and you can see the whole raw dataset at Traffic Violations in Maryland County | Kaggle. This dataset shows traffic violations recorded from 2012 to 2018 with 1,137,349 violations recorded.
In order to structure our analysis to best help the local Montgomery County Police Department, we came up with 5 questions to be answered through our analysis.
Which areas in Maryland produce the most accidents?
Using the longitude and latitude columns provided in the dataset, we are pinpointing where every violation involving an accident occurs in order to find where the most accidents occur. This is important for the police department to know because if most accidents occur within the MCPD’s jurisdiction, they can take notice of potentially high-risk traffic areas and deploy appropriate signs and warnings to drivers. They can also increase the number of speed cameras or local patrol in those areas to decrease accidents.
Are there any signs of racial bias in the police department traffic control?
Using the type of violation reporting system, technology with cameras, or actually human police officers, we can see whether or not there is any racial bias in who gets pulled over or indicated for a traffic violation. We hypothesize that the technology reporting systems will indicate a near 1:1 ratio of violations for white drivers and non-white drivers. If this is true, then we can see whether or not the human reporting system shows different results to indicate any racial bias against non-white drivers. This is possible because in Montgomery County, 43% of residents are White which means that the number of white people in Montgomery County to non-white people is comparable. This is important because racial bias being in the police department is detrimental to the community. Not only that, but from the police department's point of view, it would be terrible for their reputation in this day and age with all the racial tension in the United States for their police officers to be pulling over minorities at a disproportionate rate compared to white residents. In order to keep the trust of the community and keep roads safer, the police need to know of any potential indications of racial bias in their department.
Does a reckless driver get a warning or a citation?
This analysis uses our own discretion for what a reckless driver is:
The driver was involved in an accident.
The driver contributed to an accident.
The driver damaged property with their vehicle.
The driver was under the influence of alcohol.
This analysis can show us how effective the police department is enforcing traffic laws. A citation is more harmful to a driver than a warning but if the incident is harmful or dangerous enough, a citation is necessary in order to keep the roads safer. If citations are not given enough then the police department should acknowledge it in their training.
We will also test for males vs. females on whether drivers get a citation or warning to see whether there is any bias there for the police department to clean up.
Which Gender causes the most accidents?
Using the gender column in the dataset, we can analyze which gender causes the most accidents. This, similarly to the area analysis, will add more information on how traffic accidents occur. We can see clearly whether one gender has more violations involving accidents than another. This is important to the police department as they can see if men or women cause more accidents, and you can see where each gender drives most often in order to take the same precautions as in the previous analysis with more patrol officers and cameras.
Predicting for a given set of vehicles will it cause an accident or not (using ML).
This analysis will use machine learning in order to predict whether a certain type of vehicle or vehicle will cause an accident or not. Using the dataset and its trends, we can predict this in order to show the police department the likelihood a vehicle causes an accident or not This is important because if the police are flushed with calls as they can be and a violation is recorded on a speed camera or using technology, we can determine the chance that there was an accident to be reported with the violation. We can also use this machine learning to determine whether a certain vehicle can be labeled a high-risk vehicle or not which will make the police department more aware when doing routine stops for violations. With high risk vehicles, we can give warnings to drivers in advance of accidents to keep roads safer.
We have chosen the Data Analysis part because we were able to go above and beyond in the techniques and visualizations. We tried to create a unique story through our analyses and how our projects analysis can help the police to do a better job in keeping the people of the community safe and secure and which more rules to apply. Our analysis is not just basic as we used different techniques to gather our answers including the use of follium, interactive plots, prediciting ML algorithoms with an accuracy of 97.4%.
For our analysis, We have identified a Traffic Violation dataset from Kaggle(https://www.kaggle.com/datasets/rounak041993/traffic-violations-in-maryland-county), from which we will be cleaning, analyzing and deriving insights in order to come up with useful conclusions
#Import the python libraries needed to run the code
import pandas as pd
import numpy as np
from numpy import nan as NA
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import plotly.graph_objects as go
import folium
from folium import plugins
from folium.plugins import HeatMap
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
%matplotlib inline
#Set options to avoid displaying warnings while slicing dataframe
pd.options.mode.chained_assignment = None
pd.set_option('display.max_rows', 10)
#Import the data and display the first few rows of observation
traffic_violation_df = pd.read_csv('Traffic_Violation.csv')
traffic_violation_df.head()
C:\Users\Admin\AppData\Local\Temp\ipykernel_36456\403433003.py:2: DtypeWarning: Columns (19,20,21,22,23,24,25,34) have mixed types. Specify dtype option on import or set low_memory=False.
traffic_violation_df = pd.read_csv('Traffic_Violation.csv')
| SeqID | Date Of Stop | Time Of Stop | Agency | SubAgency | Description | Location | Latitude | Longitude | Accident | ... | Charge | Article | Contributed To Accident | Race | Gender | Driver City | Driver State | DL State | Arrest Type | Geolocation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | fbc324ab-bc8d-4743-ba23-7f9f370005e1 | 08/11/2019 | 20:02:00 | MCP | 2nd District, Bethesda | LEAVING UNATTENDED VEH. W/O STOPPING ENGINE, L... | CORDELL ST @ NORFOLK AVE. | 38.989743 | -77.097770 | No | ... | 21-1101(a) | Transportation Article | False | BLACK | M | SILVER SPRING | MD | MD | A - Marked Patrol | (38.9897433333333, -77.09777) |
| 1 | a6d904ec-d666-4bc3-8984-f37a4b31854d | 08/12/2019 | 13:41:00 | MCP | 2nd District, Bethesda | EXCEEDING POSTED MAXIMUM SPEED LIMIT: 85 MPH I... | NBI270 AT MIDDLEBROOK RD | 39.174110 | -77.246170 | No | ... | 21-801.1 | Transportation Article | False | WHITE | M | SILVER SPRING | MD | MD | A - Marked Patrol | (39.17411, -77.24617) |
| 2 | 54a64f6a-df28-4b65-a335-08883866aa46 | 08/12/2019 | 21:00:00 | MCP | 5th District, Germantown | DRIVING VEH W/ TV-TYPE RECEIVING VIDEO EQUIP T... | MIDDLEBROOK AN 355 | 39.182015 | -77.238221 | No | ... | 21-1129 | Transportation Article | False | BLACK | M | GAITHERSBURG | MD | MD | A - Marked Patrol | (39.1820155, -77.2382213333333) |
| 3 | cf5479b6-9bc7-4216-a7b2-99e57ae932af | 08/12/2019 | 21:43:00 | MCP | 5th District, Germantown | DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGI... | GERMANTOWN RD AND ALE HOUSE | 39.160508 | -77.284023 | No | ... | 13-401(h) | Transportation Article | False | BLACK | M | GERMANTOWN | MD | MD | A - Marked Patrol | (39.1605076666667, -77.284023) |
| 4 | 5601ca35-8ee7-4f8e-9208-d89cde96d469 | 08/12/2019 | 21:30:00 | MCP | 2nd District, Bethesda | FAILURE OF LICENSEE TO NOTIFY ADMINISTRATION O... | EASTWEST/ 355 | 38.984247 | -77.090548 | No | ... | 16-116(a) | Transportation Article | False | BLACK | M | SILVER SPRING | MD | MD | A - Marked Patrol | (38.9842466666667, -77.0905483333333) |
5 rows × 43 columns
From the above result, we see that our data set has 43 attributes. However, for our analysis, we do not need all of them and hence, some can be excluded. Also, to make it easier to understand the data set, let's reorder the columns in accordance to the analysis that we will be performing on this data.
#Retain only the columns needed for our analysis
traffic_violation_df = traffic_violation_df[['SeqID', 'Make', 'Model', 'VehicleType', 'Race', 'Accident','Fatal','Description', 'Gender', 'Date Of Stop','State', 'Year', 'Time Of Stop','Violation Type', 'DL State', 'Driver State', 'Personal Injury', 'Property Damage', 'Alcohol', 'Latitude', 'Longitude', 'Arrest Type']]
Each observation in our dataset contains an unique ID, which is the seqID. This column can be set as the index of our dataframe and can be used to identify individual rows.
#Set an index for our data
traffic_violation_df.set_index('SeqID').head(2)
| Make | Model | VehicleType | Race | Accident | Fatal | Description | Gender | Date Of Stop | State | ... | Time Of Stop | Violation Type | DL State | Driver State | Personal Injury | Property Damage | Alcohol | Latitude | Longitude | Arrest Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SeqID | |||||||||||||||||||||
| fbc324ab-bc8d-4743-ba23-7f9f370005e1 | TOYOTA | CAMRY | 02 - Automobile | BLACK | No | No | LEAVING UNATTENDED VEH. W/O STOPPING ENGINE, L... | M | 08/11/2019 | MD | ... | 20:02:00 | Citation | MD | MD | No | No | No | 38.989743 | -77.09777 | A - Marked Patrol |
| a6d904ec-d666-4bc3-8984-f37a4b31854d | HONDA | CIVIC | 02 - Automobile | WHITE | No | No | EXCEEDING POSTED MAXIMUM SPEED LIMIT: 85 MPH I... | M | 08/12/2019 | MD | ... | 13:41:00 | Citation | MD | MD | No | No | No | 39.174110 | -77.24617 | A - Marked Patrol |
2 rows × 21 columns
Now, let's check how many observations we have for the 21 attributes that we have included in our dataset
#Determine the number of rows in our dataset
print('We have ' + str(len(traffic_violation_df))+' rows of data')
We have 1811977 rows of data
Now that we have the data set we require, let's proceed further and check if all the attributes hold valid observations or not.
#Calculate the number of rows for each variable which have a non-null value
traffic_violation_df.notnull().sum()
SeqID 1811977
Make 1811910
Model 1811765
VehicleType 1811977
Race 1811977
...
Property Damage 1811977
Alcohol 1811977
Latitude 1811977
Longitude 1811977
Arrest Type 1811977
Length: 22, dtype: int64
From the above result, we see that the difference in the number of valid observations in few attributes. This could be due to the invalid or corrupted observations, duplicated data or mislabeled columns and many more. However, all of this, if not corrected, can be misleading towards an imprecise data analysis. Hence, it is crucial that we establish the correct data cleaning process which will increase the quality of our data.
Data Cleaning can be done in different techniques based on the problem that we face. To identify the same, we need to explore each column individually.
Let's begin!!
Step 1: Let's explore the manufacturer of the vechicles involved in the violations
#Count the number of times each Vehicle make is involved in a violation
traffic_violation_df['Make'].value_counts()
TOYOTA 211188
HONDA 199886
FORD 166649
TOYT 99202
NISSAN 98115
...
NEW 1
SABUA 1
EXPRESS 1
HYUNDAQI 1
JYUDAI 1
Name: Make, Length: 4457, dtype: int64
From the generated output, we can see that there are several misspells in our data set such as 'TOYOTA' having been entered as 'TOYT' and 'HYUNDAI' as 'HYUNDAQI'. Although, it is not possible to go through all the different brands(or misspells), we can generate a list of the top 60 brands based on the violation and then identify the names that have been misspelt
#Display a list of top 60 vehicle makes involved in a violation
traffic_violation_df['Make'].value_counts().head(60)
TOYOTA 211188
HONDA 199886
FORD 166649
TOYT 99202
NISSAN 98115
...
MINI 3581
LINC 3443
SUZUKI 3107
BUIC 3030
INFINITY 2954
Name: Make, Length: 60, dtype: int64
Now that we know how many different brand names have been mis-spelt, we can go ahead and replace them with the actual brand name
#Replace the top few mis-spelt brand names
traffic_violation_df['Make'].replace(to_replace=['TOYT','TYOTA','T0YOTA','TOY0TA','T0Y0TA','TOYOT','TOYOY','TOYATA','TOYTA','TOYOA','TOYA','TPYOTA','TOYO','TOYOTAA','TOY'], value="TOYOTA", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['HOND','HINDA','HODNA', 'HYUNDA'], value="HONDA", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['NISS', 'NISSIAN'], value="NISSAN", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['CHEV', 'CHEVY', 'CHEVORLET'], value="CHEVROLET", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['HYUN','HYUNDAQI', 'HYUND'], value="HYUNDAI", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['MERC', 'MERZ', 'MERCEDES BENZ', 'MERCEDEZ', 'MER'], value="MERCEDES", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['VW', 'VOLKS', 'VOLK', 'VOLKSWAGON', 'VOLKSWAGAN'], value="VOLKSWAGEN", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['MAZD'], value="MAZDA", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['VOLV'], value="VOLVO", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['LEXS', 'LEXU', 'LEX'], value="LEXUS", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['SUBA'], value="SUBARU", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['CADI'], value="CADILLAC", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['MITS'], value="MITSUBISHI", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['INFI', 'INFINITY'], value="INFINITI", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['CHRY', 'CHRYS'], value="CHRYSLER", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['DODG'], value="DODGE", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['ACUR'], value="ACURA", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['PONT'], value="PONTIAC", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['LINC'], value="LINCOLN", inplace = True)
traffic_violation_df['Make'].replace(to_replace=['BUIC', 'BUIK'], value="BUICK", inplace = True)
traffic_violation_df['Make'].value_counts().head(10)
TOYOTA 318281 HONDA 267531 FORD 166649 NISSAN 138556 CHEVROLET 133656 HYUNDAI 62838 DODGE 59431 ACURA 55292 MERCEDES 53232 BMW 50442 Name: Make, dtype: int64
We can see a marginal increase in the number of brands being involved in traffic violations post cleaning this field, which would have otherwise been categorized as a different brand altoghether.
Step 2: Let's explore the race of the driver mentioned in the dataset
We can begin to do so by checking the count of each unique type of vehicle in this fields.
#Calculate the count of each Race in our dataset
traffic_violation_df['Race'].value_counts()
WHITE 622462 BLACK 576774 HISPANIC 398309 OTHER 107749 ASIAN 103366 NATIVE AMERICAN 3317 Name: Race, dtype: int64
There is a racial group known as 'Other.' All the other racial groupings are accurately displayed and identified. This column does not need to be cleaned because there are no NaN values and all the rows in this column are properly identifiable.
Step 3: Let's explore the gender of the driver involved in the violations
#Calculate the count of each Gender in our dataset
traffic_violation_df['Gender'].value_counts()
M 1218321 F 590975 U 2681 Name: Gender, dtype: int64
Here, M signifies as Males, F signifies as Females and U as Unique. Since, all the gender codes are valid, there is no need of cleaning requied for this column
Step 4: Let's explore the DL state which holds the data of which state traffic violator's driving license was issued
Before we proceed further into this column, we may change the number of rows displayed by python in order to view multiple rows at the same time.
#Expanding the output display to see the entire data of columns
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)
Now, let's find the number of unique states and the number of observations in each
# Finding the total number of unique DL State names
num_dl_states = traffic_violation_df['DL State'].nunique()
print(f'The total number of states in DL State column: {num_dl_states}')
#Displaying the count of each DL state
traffic_violation_df['DL State'].value_counts()
The total number of states in DL State column: 71
MD 1576139 DC 60868 VA 59940 XX 25827 PA 10924 FL 10145 NY 8118 NC 6112 CA 5922 TX 4699 WV 4116 GA 4005 NJ 3844 MA 2768 OH 2275 DE 1966 IL 1959 SC 1795 WA 1431 MI 1368 AZ 1288 CT 1270 TN 1223 CO 1213 US 842 IN 768 AL 714 MO 698 LA 672 WI 481 MN 475 MS 455 NV 439 NM 426 KY 408 ME 406 OK 387 UT 382 RI 371 OR 338 NH 311 KS 294 VI 293 HI 271 IA 262 ON 240 AK 225 AR 215 ND 193 MT 175 ID 163 NE 155 VT 155 PR 123 MB 102 IT 74 SD 66 AB 44 WY 42 NB 34 QC 33 SK 26 BC 17 GU 16 PE 12 AS 9 NS 7 PQ 6 MH 5 NF 2 YT 1 Name: DL State, dtype: int64
From the above result, we see that in the 'DL State' column, there are 70 different States and it contains several wrong state name abbreviations such as 'AB','BC','IT', 'MB','MH','NB','NF','NS','ON','PE','PQ','QC','SK','US' and 'XX'.
Now we have 2 options:
(a) To replace the incorrect state abbreviations with the valid ones.
(b) To replace the incorrect state abbreviations with 'XX'.
However, we'll proceed with option(b) beacuse we have no basis to predict which incorrect abbreviation corresponds to valid abbreviation. For example: 'AB' is an incorrect abbreviation and if we were to replace it with a valid abbreviation, we wouldn't know which valid abbreviation to choose from : 'AK', 'AL', 'AR', 'AS' or 'AZ'.
Also, there are many valid abbeviations starting with the same letter so we cannot replace the invalid abbreviation with a valid abbreviation.
Thus, we choose to replace the incorrect state abbreviations with 'XX'.
Source: https://www.faa.gov/air_traffic/publications/atpubs/cnt_html/appendix_a.html to get valid state name abbreviations
Now that we know how many different state abbreviations have been mis-spelt in DL State column, we can go ahead and replace them with 'XX'
#Replace mis-spelt state names with XX
traffic_violation_df['DL State'].replace(to_replace=['AB','BC','IT', 'MB','MH','NB','NF','NS','ON','PE','PQ','QC','SK','US'], value="XX", inplace = True)
We can count the number of states post replacement to verify if there is a reduction in the number.
# Finding the total number of unique DL State names
valid_num_dl_states = traffic_violation_df['DL State'].nunique()
print(f'The total number of valid states in DL State column: {valid_num_dl_states}')
The total number of valid states in DL State column: 57
Therefore, after cleaning the 'DL State' column, we have 50 valid states of the U.S., 6 insular states which are AS GU MP PR VI UM and Washington, DC isn't a state, it;s a district
Step 5: Let's explore Drivers State which signifies the state of the driver’s home address.
This is similar to the DL state column we cleaned, hence we can also replace the invalid state names with 'XX'
#Replace mis-spelt state names with XX
traffic_violation_df['Driver State'].replace(to_replace=['AB','BC','IT', 'MB','NB','NF','NS','ON','PE','PQ','QC','SK','US'], value="XX", inplace = True)
#Displaying the count of each Driver state
traffic_violation_df['Driver State'].value_counts().sort_index(ascending=True)
AK 103 AL 500 AR 136 AZ 581 CA 3353 CO 680 CT 896 DC 59933 DE 1628 FL 6320 GA 2486 GU 7 HI 139 IA 108 ID 76 IL 1204 IN 564 KS 165 KY 291 LA 414 MA 1808 MD 1635281 ME 236 MI 864 MN 281 MO 433 MS 322 MT 88 NC 4268 ND 145 NE 94 NH 200 NJ 2842 NM 285 NV 265 NY 5592 OH 1591 OK 270 OR 182 PA 9202 PR 34 RI 232 SC 1122 SD 41 TN 730 TX 2724 UT 224 VA 56366 VI 8 VT 104 WA 867 WI 290 WV 3956 WY 35 XX 1400 Name: Driver State, dtype: int64
From the above generated list, we can observe that the state names have been cleaned and the only invalid state is 'XX', which we shall retain for our analysis
Step 6: Let's explore the Location, Latitutde and Longitude of the traffic violation
Latitude and Longitude are geographical coordinates on the Earth. Latitude is based off the distance from the North and South pole and represents a vertical coordinate while longitude is based on the Prime Meridian as its standard 0. Both Latitude and Longitude are both represented in degrees from their respective markers. In this data set, the latitude and longitude are used to put the coordinates for every traffic violation recorded.
The data for latitude and longitude does have some cleaning to do as some of the latitudes and longitudes fall outside the range of coordinates that are in the state of Maryland. Since we cannot interpret locations in order to impute data, we need to filter out the bad data points by dropping them. In this case, we classify valid coordinates that have latitudes between 37 degrees and 40 degrees. For longitude, we classified valid coordinates as between -82 degrees and -75 degrees. This ranges are based on research done at https://www.mapsofworld.com/usa/states/maryland/lat-long.html which shows data for latitude and longitude of major cities and locations in Maryland.
There are also instances of NaN values for latitude and longitude which means there is missing data. Again, since we cannot assume to inpute any data into these columns, the only solution is to either drop them or leave them. Since the number of NaNs is negligble compared to the rest of the data at about 7.9%, we decided to drop them as well which was all done in one step per column.
#Set the number of rows to be displayed back to default(10)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', 10)
#Converting latitude and longitude to the required format
lattitude_ = traffic_violation_df['Latitude'].astype(float)
longitude_ = traffic_violation_df['Longitude'].astype(float)
#Filtering out the Latitude and Longitude datapoints that do not fall within Marylands range
traffic_violation_df= traffic_violation_df[(traffic_violation_df['Latitude']>37) & (traffic_violation_df['Latitude']<40)]
traffic_violation_df = traffic_violation_df[(traffic_violation_df['Longitude'] > -82) & (traffic_violation_df['Longitude'] < -75)]
Now that we have exlpored the main attributes required for our data analysis, we still feel that it can be improved through involving some more meangingful data
Step 7: Use of dummy variables for categorical variables
Having a dummy variable instead of the text in categorical vaiables, such as 'Accident', 'Personal Injury', 'Property Damage' and 'Alocohol' would make it easier to calulate the total number. Hence, we replace all the values that are 'No' with a 0 and all the values that are 'Yes' with 1.
#Set categorical variables
no = 0
yes = 1
#Assign categorical variables for each column.
traffic_violation_df['Accident']=traffic_violation_df['Accident'].map(lambda x: yes if x=='Yes' else no)
traffic_violation_df['Personal Injury']=traffic_violation_df['Personal Injury'].map(lambda x: yes if x =='Yes' else no)
traffic_violation_df['Property Damage']=traffic_violation_df['Property Damage'].map(lambda x: yes if x =='Yes' else no)
traffic_violation_df['Alcohol']=traffic_violation_df['Alcohol'].map(lambda x: yes if x =='Yes' else no)
traffic_violation_df.head(2)
| SeqID | Make | Model | VehicleType | Race | Accident | Fatal | Description | Gender | Date Of Stop | ... | Time Of Stop | Violation Type | DL State | Driver State | Personal Injury | Property Damage | Alcohol | Latitude | Longitude | Arrest Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | fbc324ab-bc8d-4743-ba23-7f9f370005e1 | TOYOTA | CAMRY | 02 - Automobile | BLACK | 0 | No | LEAVING UNATTENDED VEH. W/O STOPPING ENGINE, L... | M | 08/11/2019 | ... | 20:02:00 | Citation | MD | MD | 0 | 0 | 0 | 38.989743 | -77.09777 | A - Marked Patrol |
| 1 | a6d904ec-d666-4bc3-8984-f37a4b31854d | HONDA | CIVIC | 02 - Automobile | WHITE | 0 | No | EXCEEDING POSTED MAXIMUM SPEED LIMIT: 85 MPH I... | M | 08/12/2019 | ... | 13:41:00 | Citation | MD | MD | 0 | 0 | 0 | 39.174110 | -77.24617 | A - Marked Patrol |
2 rows × 22 columns
Analysis 1: Which Cities in the state of Maryland contribute to the most number of Accidents?
We created a heatmap that depicts the density of accidents in the state of Maryland to assist the police department in determining where they should focus their efforts in reducing accidents. The number of accidents is indicated by the color code: Green: Low accident risk region Yellow: Medium accident risk region Red: High accident risk region [1]
# Using folium to create a heat map and entering location coordinates of Maryland State
traffic_violation_df_map = folium.Map(location=[39.045753, -76.641273],
zoom_start = 10)
# Need to convert all values in the columns into float
traffic_violation_df['Latitude'] = traffic_violation_df[traffic_violation_df['Accident']==1]['Latitude'].astype(float)
traffic_violation_df['Longitude'] = traffic_violation_df[traffic_violation_df['Accident']==1]['Longitude'].astype(float)
# Filtering the Dataframe for rows, then columns, then remove NaNs
heat_df = traffic_violation_df[traffic_violation_df['Accident']==1][['Latitude', 'Longitude']]
heat_df = heat_df.dropna(axis=0, subset=['Latitude','Longitude'])
# Using a list comprehension to make a list of lists
heat_data = [[row['Latitude'],row['Longitude']] for index, row in heat_df.iterrows()]
# Plotting it on the map
HeatMap(heat_data,min_opacity=0.2).add_to(traffic_violation_df_map)
# Displaying the map
traffic_violation_df_map
Observations from the above visualization are as follows: (i) Glenmont contributes to the maximum number of Accidents.
Analysis 2: Is there any racial bias in Maryland traffic police?
One question that came up while investigating the dataset was whether or not racial bias is shown in the number of traffic violations based on race. In order to test this theory, we focused on two columns ; the Arrest Type column and the Race column. First we cleaned the Arrest Type column by eliminating the unncessary letter codes for each type of arrest, then we put each arrest type into 2 groups; human stops and technology stops. We determined that the traffic violations were occuring by either an actual police or person of authority conducting the stops, or a camera/radar or some other form of technology recording the violations. [2]
For the human recorded violations, we determined that they were :
'Marked Patrol' 'Foot Patrol' 'Unmarked Patrol' 'Motorcycle' 'Marked (Off-Duty)' 'Mounted Patrol' 'Unmarked (Off-Duty)'
For the technology recorded violations, we determined that they were:
'Marked Laser' 'Marked Stationary Radar' 'License Plate Recognition' 'Unmarked VASCAR' 'Marked Moving Radar (Moving)' 'Unmarked Moving Radar (Stationary)' 'Marked Moving Radar (Stationary)' 'Unmarked Stationary Radar' 'Marked VASCAR' 'Unmarked Moving Radar (Moving)'
We also determined that 'Aircraft Assistance' was not one way or the other as the assistance implies that it is not the only thing working towards recording the violation.
The reason to divide the violations into human and technology is to use technology as a base for comparison. Technology cannot have racial bias unless it is programmed into it which is highly doubtful as it has no place in traffic violations. So we hypthosesized that the ratio between non-white and white traffic violations from technology stops would be close to 1.
We also hypothesized that non-white races (being Hispanic, Black, Asian, and Native American) would be targeted more during traffic violations and thus yield a high ratio of non-white to white traffic violations from human recordings.
#Check the unique arrest types in our dataset
traffic_violation_df['Arrest Type'].unique()
array(['A - Marked Patrol', 'L - Motorcycle', 'Q - Marked Laser',
'I - Marked Moving Radar (Moving)', 'B - Unmarked Patrol',
'F - Unmarked Stationary Radar', 'R - Unmarked Laser',
'G - Marked Moving Radar (Stationary)',
'E - Marked Stationary Radar', 'O - Foot Patrol',
'H - Unmarked Moving Radar (Stationary)', 'M - Marked (Off-Duty)',
'J - Unmarked Moving Radar (Moving)', 'N - Unmarked (Off-Duty)',
'S - License Plate Recognition', 'C - Marked VASCAR',
'P - Mounted Patrol', 'D - Unmarked VASCAR', 'K - Aircraft Assist'],
dtype=object)
#Clean the data for letter codes and whitespace
traffic_violation_df['Arrest Type']= traffic_violation_df['Arrest Type'].str.replace('(\D{1}\s\-\s)','')
C:\Users\Admin\AppData\Local\Temp\ipykernel_36456\634631987.py:2: FutureWarning: The default value of regex will change from True to False in a future version.
traffic_violation_df['Arrest Type']= traffic_violation_df['Arrest Type'].str.replace('(\D{1}\s\-\s)','')
#Classify arrest types as human or tech
human_check = ['Marked Patrol','Foot Patrol','Unmarked Patrol','Motorcycle','Marked (Off-Duty)','Mounted Patrol','Unmarked (Off-Duty)']
tech_check = ['Marked Laser','Marked Stationary Radar','Unmarked Laser','License Plate Recognition','Unmarked VASCAR','Marked Moving Radar (Moving)','Unmarked Stationary Radar','Marked VASCAR','Unmarked Moving Radar (Moving)']
#Create different dataframes for those checked by humans and tech
human_check_df = traffic_violation_df[traffic_violation_df['Arrest Type'].isin(human_check)]
tech_check_df = traffic_violation_df[traffic_violation_df['Arrest Type'].isin(tech_check)]
#Display the unique races
traffic_violation_df.Race.unique()
array(['BLACK', 'WHITE', 'HISPANIC', 'OTHER', 'ASIAN', 'NATIVE AMERICAN'],
dtype=object)
#Classify the non-white races and calculate the numbers of pullovers
non_white_races = ['BLACK','HISPANIC','ASIAN','NATIVE AMERICAN']
print(f'Non white pullovers by tech {tech_check_df.Race.isin(non_white_races).sum()}')
print(f'Non white pullovers by human {human_check_df.Race.isin(non_white_races).sum()}')
print(f'White pullovers by tech {(tech_check_df.Race=="WHITE").sum()}')
print(f'White pullovers by human {(human_check_df.Race=="WHITE").sum()}')
Non white pullovers by tech 99640 Non white pullovers by human 901066 White pullovers by tech 89078 White pullovers by human 480048
#Calclate the ratio of pullovers for different races
non_white_tech = tech_check_df.Race.isin(non_white_races).sum()
non_white_human = human_check_df.Race.isin(non_white_races).sum()
white_tech = (tech_check_df.Race=="WHITE").sum()
white_human = (human_check_df.Race=="WHITE").sum()
tech_others_white_ratio = (non_white_tech/white_tech)
human_others_white_ratio = (non_white_human/white_human)
#Display the ratio results
print(f'The ratio of non-white traffic violations to whites recorded by humans is {human_others_white_ratio:.2f} which is nearly 2:1.')
print(f'The ratio of non-white Traffic violations to whites recored by technology is {tech_others_white_ratio:.2f} which is nearly 1, making it even.')
The ratio of non-white traffic violations to whites recorded by humans is 1.88 which is nearly 2:1. The ratio of non-white Traffic violations to whites recored by technology is 1.12 which is nearly 1, making it even.
#Display the results in a tabular form
racial_bias_df = pd.DataFrame({'Arrest Type':['Tech','Tech', 'Human','Human'], 'Race':['White','Non-White','White','Non-White'],'Number of Violations': [white_tech,non_white_tech,white_human,non_white_human]}, index = [1,2,3,4])
racial_bias_df
| Arrest Type | Race | Number of Violations | |
|---|---|---|---|
| 1 | Tech | White | 89078 |
| 2 | Tech | Non-White | 99640 |
| 3 | Human | White | 480048 |
| 4 | Human | Non-White | 901066 |
#Plot a graph of pullovers by tech and human for people of different races
import plotly.graph_objects as go
fig = go.Figure(data=[
go.Bar(name='White', x=racial_bias_df[racial_bias_df['Race'] == 'White']['Arrest Type'], y=racial_bias_df[racial_bias_df['Race'] == 'White']['Number of Violations']),
go.Bar(name='Non-white', x=racial_bias_df[racial_bias_df['Race'] == 'Non-White']['Arrest Type'], y=racial_bias_df[racial_bias_df['Race'] == 'Non-White']['Number of Violations']),
])
fig.update_layout(
title= "Human VS Tech Racial Profiling",
xaxis_title="Arrest Type",
yaxis_title="Count",
legend_title="Legends",
font=dict(
family="Courier New, monospace",
size=18,
color="RebeccaPurple"
))
fig.show()
Our results have found that the tech recording yielded a ratio of 0.98 white to non-white traffic violations which is on par with our hypothesis. The human recording showed a ratio of 1.7 which means it is almost 2:1 the ratio between non-white traffic violations and white traffic violations recorded by humans. This graph shows the difference in raw sums and as you can see, the tech arrest type yieleded nearly even results while the human recording was overwhelmingly less for whites compared to non-whites. Thus, we can conclude that there is racial bias in traffic violations in Maryland based on this dataset.
We can infer that police officers in Maryland feel the need to stop non-white drivers rather than white drivers due to personal bias. We can't say that every traffic police officer in the area is 'racist' but we can infer clear bias from the group as a whole.
Analysis 3: Does a reckless driver get a warning or a citation? and what's the warning rate between males and females [3]
Now let's scrutinize whether a reckless driver gets a warning or a citation? what's the warning rate between males and females
A reckless driver can be distinguished by atleast one of the following events:
# Calculate the count of reckless drivers in the dataset
reckless_drivers_df = traffic_violation_df.loc[(traffic_violation_df['Accident'] == 1) | (traffic_violation_df['Personal Injury'] == 1) | (traffic_violation_df['Property Damage'] == 1) | (traffic_violation_df['Alcohol'] == 1)]
count_reckless_drivers = len(reckless_drivers_df)
# Calculate the count of reckless drivers that received warning
warning_rd_df = reckless_drivers_df.where(reckless_drivers_df['Violation Type'] == 'Warning').dropna()
reckless_drivers_warning = len(warning_rd_df)
# FEMALE
# Count of female reckless drivers
female_reckless_drivers = len(reckless_drivers_df.where(reckless_drivers_df['Gender'] == 'F').dropna())
# Count of female reckless drivers that received warnings
female_warnings = len(warning_rd_df.where(warning_rd_df['Gender'] == 'F').dropna())
# Count of female reckless drivers that received citations
female_citations = female_reckless_drivers - female_warnings
# Count of male reckless drivers
male_reckless_drivers = len(reckless_drivers_df.where(reckless_drivers_df['Gender'] == 'M').dropna())
# Count of male reckless drivers that received warnings
male_warnings = len(warning_rd_df.where(warning_rd_df['Gender'] == 'M').dropna())
# Count of male reckless drivers that received citations
male_citations = male_reckless_drivers - male_warnings
print(f'Out of {len(traffic_violation_df)} violations, there are {count_reckless_drivers} reckless drivers and out of these, {reckless_drivers_warning} received a warning instead of a citation')
print(f'If we look further into it, we see that the female warning rate is {(female_warnings/female_reckless_drivers)*100:.2f}% whereas male warning rate is {(male_warnings/male_reckless_drivers)*100:.2f}%')
Out of 1685941 violations, there are 73306 reckless drivers and out of these, 3179 received a warning instead of a citation If we look further into it, we see that the female warning rate is 9.53% whereas male warning rate is 6.40%
# Let's visualize the result
from matplotlib import rcParams
df = pd.DataFrame(dict(
x=['Male', 'Female'],
y1=[male_citations, female_citations],
y2=[male_warnings, female_warnings]
))
bar_plot1 = sns.barplot(x='x', y='y1', data=df, label="Citations", color="c")
bar_plot2 = sns.barplot(x='x', y='y2', data=df, label="Warnings", color="m")
bar_plot1.set(xlabel='GENDER', ylabel='COUNT')
plt.title("Reckless Drivers: Warning vs Citation")
sns.set(style="darkgrid")
rcParams['figure.figsize'] = 5,5
plt.legend()
plt.show()
Analysis 4: Who are the worst drivers based on race and gender?
Analyzing if there is any correlation between race and gender of a person with accidents. Also understanding if there are any reason for the correlation
#Create a new dataframe where drivers are male and involved in accidents
df_new_male_accident = traffic_violation_df[(traffic_violation_df['Gender'] == 'M')&(traffic_violation_df['Accident']==1)]
#Create a new dataframe where driers are females and involved in accidents
df_new_female_accident = traffic_violation_df[(traffic_violation_df['Gender'] == 'F')&(traffic_violation_df['Accident']==1)]
Now we are taking the max race and max people in the gender which is white race and male gender respectively, so we are creating a new dataframe which has male people who have made accidents and now we will want to find out how many of those male people are white in comparison to other races.
#Counting males who are white and converting to percentage
no_of_rows_male_white=len(df_new_male_accident.index)
count_male_white=(df_new_male_accident['Race']=='WHITE').sum()
male_white_percentage=(count_male_white/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of white race to cause an accident is {count_male_white} which is {male_white_percentage:.2f}%')
count_male_black=(df_new_male_accident['Race']=='BLACK').sum()
male_black_percentage=(count_male_black/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of black race to cause an accident is {count_male_black} which is {male_black_percentage:.2f}%')
count_male_hispanic=(df_new_male_accident['Race']=='HISPANIC').sum()
male_hispanic_percentage=(count_male_hispanic/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of hispanic race to cause an accident is {count_male_hispanic} which is {male_hispanic_percentage:.2f}%')
count_male_other=(df_new_male_accident['Race']=='OTHER').sum()
male_other_percentage=(count_male_other/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of other race to cause an accident is {count_male_other} which is {male_other_percentage:.2f}%')
count_male_asian=(df_new_male_accident['Race']=='ASIAN').sum()
male_asian_percentage=(count_male_asian/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of asian race to cause an accident is {count_male_asian} which is {male_asian_percentage:.2f}%')
count_male_native_american=(df_new_male_accident['Race']=='NATIVE AMERICAN').sum()
male_native_american_percentage=(count_male_native_american/no_of_rows_male_white)*100
print(f'Frequency of male gender who is of native american race to cause an accident is {count_male_native_american} which is {male_native_american_percentage:.2f}%')
Frequency of male gender who is of white race to cause an accident is 9421 which is 31.74% Frequency of male gender who is of black race to cause an accident is 7556 which is 25.46% Frequency of male gender who is of hispanic race to cause an accident is 9799 which is 33.02% Frequency of male gender who is of other race to cause an accident is 1500 which is 5.05% Frequency of male gender who is of asian race to cause an accident is 1333 which is 4.49% Frequency of male gender who is of native american race to cause an accident is 70 which is 0.24%
#Counting males who are white and converting to percentage
no_of_rows_female_white=len(df_new_female_accident.index)
count_female_white=(df_new_female_accident['Race']=='WHITE').sum()
female_white_percentage=(count_female_white/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of white race to cause an accident is {count_female_white} which is {female_white_percentage:.2f}%')
count_female_black=(df_new_female_accident['Race']=='BLACK').sum()
female_black_percentage=(count_female_black/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of black race to cause an accident is {count_female_black} which is {female_black_percentage:.2f}%')
count_female_hispanic=(df_new_female_accident['Race']=='HISPANIC').sum()
female_hispanic_percentage=(count_female_hispanic/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of hispanic race to cause an accident is {count_female_hispanic} which is {female_hispanic_percentage:.2f}%')
count_female_other=(df_new_female_accident['Race']=='OTHER').sum()
female_other_percentage=(count_female_other/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of other race to cause an accident is {count_female_other} which is {female_other_percentage:.2f}%')
count_female_asian=(df_new_female_accident['Race']=='ASIAN').sum()
female_asian_percentage=(count_female_asian/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of asian race to cause an accident is {count_female_asian} which is {female_asian_percentage:.2f}%')
count_female_native_american=(df_new_female_accident['Race']=='NATIVE AMERICAN').sum()
female_native_american_percentage=(count_female_native_american/no_of_rows_female_white)*100
print(f'Frequency of female gender who is of native american race to cause an accident is {count_female_native_american} which is {female_native_american_percentage:.2f}%')
Frequency of female gender who is of white race to cause an accident is 5549 which is 41.27% Frequency of female gender who is of black race to cause an accident is 3484 which is 25.91% Frequency of female gender who is of hispanic race to cause an accident is 2546 which is 18.93% Frequency of female gender who is of other race to cause an accident is 809 which is 6.02% Frequency of female gender who is of asian race to cause an accident is 1041 which is 7.74% Frequency of female gender who is of native american race to cause an accident is 18 which is 0.13%
#Initialize a list with all unique races
data = traffic_violation_df['Race'].unique()
#Create a DataFrame with count of accidents by race and gender
new_df = pd.DataFrame(data, columns=['Race'])
count_accidents_made_by_males = [count_male_black, count_male_white, count_male_hispanic, count_male_other, count_male_asian, count_male_native_american]
count_accidents_made_by_females = [count_female_black , count_male_white, count_female_hispanic, count_female_other, count_female_asian, count_female_native_american]
new_df ['Count_Male_Accidents'] = count_accidents_made_by_males
new_df ['Count_Female_Accidents'] = count_accidents_made_by_females
# Create a pivot table to display the data
new_df_pivot = pd.pivot_table(new_df, values = ['Count_Male_Accidents', 'Count_Female_Accidents'], columns = 'Race')
new_df_pivot
| Race | ASIAN | BLACK | HISPANIC | NATIVE AMERICAN | OTHER | WHITE |
|---|---|---|---|---|---|---|
| Count_Female_Accidents | 1041 | 3484 | 2546 | 18 | 809 | 9421 |
| Count_Male_Accidents | 1333 | 7556 | 9799 | 70 | 1500 | 9421 |
#Create a line chart to visualize this data [4]
import plotly.offline as pyo
layout = go.Layout(title = 'Male count vs Female count making accidents')
traces =[go.Scatter(
x = new_df_pivot.columns,
y= new_df_pivot.loc[rowname],
mode = 'markers+lines',
name = rowname
)for rowname in new_df_pivot.index]
figure = go.Figure (data = traces, layout=layout)
figure.update_layout(
xaxis_title="Race",
yaxis_title="Count",
legend_title="Legends",
font=dict(
family="Courier New, monospace",
size=18,
color="RebeccaPurple"
))
figure.show()
The observations from the graph are: (i) Hispanic Males followed closely by White Males and Females cause the most number of accidents. (ii) Native American Males and Females cause the least number of accidents.
Analysis 5: Now that we have understood which all factors contribute to an accident, Let's predict if there will be an accident based on these parameters. We will use a machine learning algortithm for this prediction
Machine learning uses a data set to understand and then builds a model to leverage that data to to improve performance and predict data points. Machine Learning models can be categorized as supervised or unsupervised.
Supervised leaning is an approach that uses labelled datasets which are then classified to predict the outcomes accurately. The model measures the accuracy using the input-output pair. It is further divided into two types:
● Regression – This method understands the relationship between the dependent and independent variables and predict numerical values based on data points. The output is continous
Below are some commonly used regression models:
Linear Regression – Here, the ‘line of best fit’ represents a dataset and is found by minimizing the squared error (squared distance between the line of best fit and points) This helps us in predicting the future points and identifying outliers.
Decision Tree – Here, the dataset is continuously spilt based on a certain parameter. Each node is further divided into multiple nodes. The last node is where the decision is made is the leaves of the tree. The more nodes you have, the better accuracy you get.
Random Forest – Here, it involves multiple decision trees based on different samples and takes the majority vote in case of classification and averaging in case of regression
Neural Network – Here, it takes one or more inputs and goes through a network of equations, resulting in one or more outputs
● Classification – The output is discrete. Commonly used models are Logistic Regression, Super Vector Machine, Naive Bayes, Decision Tree, Random Forest, and Neural Network
Unsupervised Learning uses algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention.
Now that we have understood how each model works and are different from one another, we land to a decision that for our predictions, we will use "Random Forest" model. Random forest can handle both continuous and categorical variables and since we have to predict a categorical variable, we use Random forest as it performs better and gives better results.
To predict the accident, we are using the following parameters: 'Race', 'Gender', 'State', 'Year', 'Make', 'DL State' The data set does contain more attributes, however they do not contribute to an accident or are the after effects.
We are storing the parameters responsible for an accident in a variable X which are the independent data points and we are using Y as the dependent variable which is "Accident"
#Define the dependent and independent variables
X = traffic_violation_df[['Race', 'Gender', 'State', 'Year', 'Make', 'DL State']]
y= traffic_violation_df['Accident']
Since Machine Learning models cannot handle text, we have to map all the values in our attributes to some integers.
Gender - We map all Males to 2, all Females to 1 and all Unknowns become 0
Race - We map Black to 0, White to 1, Hispanic to 2, Asian to 3, Native American to 4 and all others become 5
DL state - We map all Maryland drivers to 1 and all the rest of the states' drivers become NaN by default. We will handle this in the further session.
#Map the genders to numeric values
gender_categories = {"M": 2, "F": 1, "U":0}
X['Gender']= X['Gender'].map(gender_categories)
#Map the races to numeric values
race_categories = {"BLACK": 0, 'WHITE':1, "HISPANIC": 2, "ASIAN":3, 'NATIVE AMERICAN':4 , 'OTHER':5}
X['Race']= X['Race'].map(race_categories)
#Map the states to numeric values
state_categories = {"MD": 1}
X['State']= X['State'].map(state_categories)
X['DL State']= X['DL State'].map(state_categories)
Now let's looking into a given set of vehichles and understand which all car models hold the most accidents
X['Make'].value_counts().head(15)
TOYOTA 296795
HONDA 248882
FORD 154847
NISSAN 128866
CHEVROLET 124063
...
VOLKSWAGEN 39107
LEXUS 38787
JEEP 38665
MAZDA 32452
SUBARU 24899
Name: Make, Length: 15, dtype: int64
From the above result, we see that majority of our data set hold the top 15 car brands. Hence, for our analysis, we only use the top 15 car brands to be precise and accurate.
#Map the vehicle make to numeric values
make_categories = {"TOYOTA": 1, 'HONDA':2, 'FORD': 3, 'NISSAN':4, 'CHEVROLET': 5, 'HYUNDAI':6,'DODGE':7, 'ACURA':8, 'MERCEDES':9, 'BMW':10, 'VOLKSWAGEN':11, 'JEEP':12, 'LEXUS':13, 'MAZDA':14, 'SUBARU':15}
X['Make']= X['Make'].map(make_categories)
Since, Machine Learning models cannot handle NaN values. We will convert all NaNs to 0 in Make, State and DL state. However, we cannot proceed with the same for "Years" since a 0 value will confuse the models as the year ranges between 1900-2020. Hence, we fill the NaN years with the mean of the years
#Fill in the missing values
X['Make'] = X['Make'].fillna(0)
X['State']= X['State'].fillna(0)
X['DL State']= X['DL State'].fillna(0)
X['Year']= X['Year'].fillna(X['Year'].mean())
Let's proceed with the next step and split the dataset into test and train data for ML. Training model holds the 70% of the data and testing model holds the rest 30%
#Split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
Before we initialize the algorithm, we need to provide the number of decision trees we need in our alogrithm. In our case, regardless of using 10 or 100, the accuracy seems to be the same. And as we increase the number of decision trees, it takes longer to process and thus, the systems slows down. So, to avoid that, we will go with 10 as our n_estimators
#Initialize the ML algorithm
forest = RandomForestClassifier(n_estimators=10)
As our next step, we are using the 70% of our data to train the model so that it can then build an algorithm to predict the rest
#Fit the training data into the ML Model
forest.fit(X_train, y_train)
RandomForestClassifier(n_estimators=10)
Since the remaining 30% of our data is for testing the model, we will use that in "Random Forest" to predict those 30% of the data points
#Predict the output
y_pred = forest.predict(X_test)
y_pred
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
Now that we have our 30% of the data predicted through "Random Forest" model, we wil compare those with the actual data points and determine the accuracy of the model
#Calculate the accuracy of our model
print(f'Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}% ')
Accuracy: 97.44%
From the above accuracy, we now know that we can predict an accident with 99.74% accuracy. A confusion matrix can further help us understand the performance of our model
#Generate a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.title('Confusion Matrix')
plot = sns.heatmap(conf_matrix, annot=True, fmt="d")
plot.set(xlabel='Predicted Values', ylabel='Actual Values')
plt.show()
From the confusion matrix, we can see that we have 492663 violations that were predicted to not result in an accident and in reality did not and 177 violations that were predicted to result in an accident and they did. However, there were also 12773 values of accident that were predicted to not to be an accident but resulted in accident and 170 violations were predicted to be an accident while it did not result in one. These errors result from the fact that only 4% of the violations in our violation dataset actually result in an accident. It could be risky to use this model right away since a potential accident can be predicted as a no-accident, which would mean there would be no precautionary measure taken here. However, we believe that as we have more data of accidents in our dataset, it would enable the model to learn more and produce a better result in the future.
#Find the output of our model based on a given set of inputs
race = 'HISPANIC'
gender = 'M'
state = 'MD'
year = '2006'
make = 'TOYOTA'
dl_state = 'MD'
test_data = []
race_map = {"BLACK": 0, 'WHITE':1, "HISPANIC": 2, "ASIAN":3, 'NATIVE AMERICAN':4 , 'OTHER':5}
if race in race_map.keys():
test_data.append(race_map[race])
gender_map = {"M": 2, "F": 1, "U":0}
if gender in gender_map.keys():
test_data.append(gender_map[gender])
state_map = {"MD": 1}
if state in state_map.keys():
test_data.append(state_map[state])
else:
test_data.append(0)
test_data.append(year)
if dl_state in state_map.keys():
test_data.append(state_map[dl_state])
else:
test_data.append(0)
model_map = {"TOYOTA": 1, 'HONDA':2, 'FORD': 3, 'NISSAN':4, 'CHEVROLET': 5, 'HYUNDAI':6,'DODGE':7, 'ACURA':8, 'MERCEDES':9, 'BMW':10, 'VOLKSWAGEN':11, 'JEEP':12, 'LEXUS':13, 'MAZDA':14, 'SUBARU':15}
if make in model_map.keys():
test_data.append(model_map[make])
else:
test_data.append(0)
y_pred = forest.predict([test_data])
gender_map = {"M": "Male", "F": "Female", "U":"Unknown"}
if gender in gender_map.keys():
gender = gender_map[gender]
accident_map = {0: " not", 1: ""}
decision = accident_map[y_pred[0]]
print(f'When a {gender}, {race} driver from {state} with their driving licence state from {dl_state} is driving a {year} model {make}, the driver is{decision} likely to cause an accident')
When a Male, HISPANIC driver from MD with their driving licence state from MD is driving a 2006 model TOYOTA, the driver is not likely to cause an accident
C:\Users\Admin\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
Conclusion
Through our analysis, the authorities will find many things helpful. 1) We were able to see which location had the most accidents being made and hence the police can be more vigilent and increase the traffic rules in those areas.
2) Racial bias should be decreased and strict measures should be taken on the police officer as well if any such false accusations are made just on the basis of race. This will help the community to also be free and safe.
3) Female warning rate is higher than the male warning rate which shows the disparity between the gender in the warnings given to a reckless driver. so theres a possibility that females take traffic violations leniently and department should look into it.
4) Male Hispanics are causing the most number of accidents in comparison to any other race/gender and hence these matters should be addressed and department should look more into it.
5) The predictions will also help that okay what type of make, model, race and gender will cause an accident in the future or not. This can help us follow a trend that if some particular make or model of car is causing most accidents so then that car has some issue.
6) This analysis will help the community be safe and hopefully help reduce accidents.